Naive-Bayes for Sentiment Classification

نویسنده

  • David Sun
چکیده

This report details the findings in building a naive Bayes sentiment classifier for a IMDB movie-review data set using Scala and ScalaNLP. We studied the unigram or bagof-words Bernoulli and Multinomial models and a number of different feature selection techniques, including term frequency, mutual information and Chi-squared. 1. DATA CORPUS The corpus contains of 2000 rated movie reviews, comprised of an equal number of reviews for each sentiment group (positive and negative). Without stemming or stop-words removal there are a total of 1293948 word and 48813 unique words in the dictionary. On average, each document in the positive sentiment corpus contains 683 words, 350 of which are unique, while each document in the negative sentiment corpus contains 610 words, 325 of which are unique. Given the relatively lengths of documents, we expect a-priori that a multinomial model to produce more accurate results than the Bernoulli model. Note that the data was partitioned by sentiment into positiveand negative‘folders’ a-priori, so no explicit tagging was required. 2. MODELS Unigram, or a bag-of-words multinomial and Bernoulli models were considered. Bigrams will be explored as future work. Naive-Bayes (NB) makes the fundamental assumption that the words in a document are conditionally independent given the class. Thus, the joint-likelihood of observing a sequence of terms conditioned on the class c easily decouples: P (t1, t2, . . . , tn|c) = ∏

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

Textual sentiment summarization

Naive Bayes The multinomial Naive Bayes model on a dictionary is a familiar option for text classification, e.g. (Gale, Church, & Yarowski 1992), (McCallum & Nigam 1998). When there are additional features, the Naive Bayes model has also a natural extension: We simply assume that each additional feature is independent of all the others, conditional upon . In this case, we invert Bayes’ Law by o...

متن کامل

Fast and Accurate Sentiment Classification Using an Enhanced Naive Bayes Model

We have explored different methods of improving the accuracy of a Naive Bayes classifier for sentiment analysis. We observed that a combination of methods like effective negation handling, word n-grams and feature selection by mutual information results in a significant improvement in accuracy. This implies that a highly accurate and fast sentiment classifier can be built using a simple Naive B...

متن کامل

Sentiment Analysis using Naive Bayes

Sentiment analysis is a challenging and interesting natural language processing task, if only because it naturally lends itself to domain adaptation. We study sentiment analysis using Naive Bayes and essentially reproducing the results from [1]. We start by describing the Naive Bayes model we use, then we describe the experimental setup and finally we discuss our observations and results. The N...

متن کامل

SWASH: A Naive Bayes Classifier for Tweet Sentiment Identification

This paper describes a sentiment classification system designed for SemEval-2015, Task 10, Subtask B. The system employs a constrained, supervised text categorization approach. Firstly, since thorough preprocessing of tweet data was shown to be effective in previous SemEval sentiment classification tasks, various preprocessessing steps were introduced to enhance the quality of lexical informati...

متن کامل

Sentiment Classification of Movie Reviews Using Hybrid Method

the area of sentiment mining (also called sentiment extraction, opinion mining, opinion extraction, sentiment analysis, etc.) has seen a large increase in academic interest in the last few years. Researchers in the areas of natural language processing, data mining, machine learning, and others have tested a variety of methods of automating the sentiment analysis process. In this research work, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012